Improved iterative Bayesian unfolding 

G. D'Agostini 
Universita "La Sapienza" and INFN, Rome, Italy 



(giulio.dagostini@romal.infn.it, http : //www . romal . inf n. it/~dagos 



Abstract 

This paper reviews the basic ideas behind a Bayesian unfolding 
published some years ago and improves their implementation. In par- 
ticular, uncertainties are now treated at all levels by probability density 
functions and their propagation is performed by Monte Carlo integra- 
tion. Thus, small numbers are better handled and the final uncertainty 
does not rely on the assumption of normality. Theoretical and practical 
issues concerning the iterative use of the algorithm are also discussed. 
The new program, implemented in the R language, is freely available, 
together with sample scripts to play with toy models. 



1 Introduction 

Physicists like to think of 'true' distributions of physics quantities, i.e those 
distributions (graphically represented by histograms and here also referred 
as 'spectra') one would observe under idealized conditions that seldom - 
never, strictly speaking - happen in real live (ideal detector, no physical or 
instrumental background). The 'observed' distribution is then considered as 
a 'noisy distortion' of the true one. An important task of the experimental 
method is therefore to infer the true distribution from the observed one, i.e. 
to correct the observed spectrum for distortion and noise. This can be done 
by different methods that follow different approaches. Since this is not a 
review paper, I just outline the two different classes of strategies and then 
focus on the specific issues of this work. 

In a first kind of approach a mathematical function for the true distribu- 
tion is assumed (together with other functions to model the noise) and the 
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task becomes that of estimating the free parameters of the function(s). In- 
deed, we fall in the so called domain of parametric inference ('fits'), usually 
associated to names like 'least-squares' or 'maximum likelihood' 

In parametric inference all information contained in the observed spec- 
trum is, to say, 'distilled' into the model parameters. Parametric inference, 
especially when the conditions to use least-squares methods hold, is usually 
the best and fastest way to proceed, if we have good reasons to believe the 
hypothesized family of functions. 

However, sometimes we wish to interpret the data as little as possible 
and just public 'something equivalent' to an experimental distribution, with 
the bin contents fluctuating according to an underlying multinomial distri- 
bution, but having possibly got rid of physical and instrumental distortions, 
as well as of background. In particle physics this second approach goes under 
the name of unfolding (or deconvolution, or restoration). 

Several years ago a simple algorithm based on Bayes' theorem was pre- 
sented [2] with which it was possible to achieve rather 'good' results ('good' 
compared with the difficulty of the task). The main improvements pre- 
sented here concern the handling of small numbers and the evaluation of 
the uncertainty on the unfolded distribution, while the guiding ideas and 
the basic assumptions are substantially unchanged. To be more clear, the 
algorithm of Ref. [2] was relying on the underlying hypotheses of normality 
and 'small relative errors': 'best estimates' were provided, with uncertainty 
calculated from standard 'error propagation' formulasll The new algorithm 
handles better small numbers, in the way it will be described in Sec. 13.21 and 
performs the propagation of uncertainty by sampling, i.e. by Monte Carlo 
(MC) integration. The algorithm has been implemented in a R language [4] 
code available on the author's web page [5]. 

The paper is organized in the following way. Section [2] gives a short 
introduction to Bayesian inference and to the specific application subject of 
this paper. Then the algorithm of Ref. [2] is reminded and the improvements 
are presented. The issue of iterative use of the algorithm is also discussed, 
although the program now gives also the possibility to use priors, provided 

1 However one can show that it is preferable to base the data analysis on more solid prob- 
abilistic ground, from which the mentioned methods might be derived (see e.g. Ref. pQ) 
under special conditions which fortunately hold in the large majority of practical cases 
(but it is better one knows when the conditions of validity hold and what to do when they 
do not!). 

2 It is worth remembering that these formulas rely on an hypothesis of linearity between 
input and output quantities, an hypothesis that holds approximately, in non-strictly linear 
problems, if the relative uncertainties of the input quantities are small enough, such that 
the dependence input— ^output can be locally linearized [3J. 
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by the user, over all possible true spectra. This option should avoid the 
need of iterations (but I anticipate that it might be not that easy to model 
such priors and then the iterative strategy remains a pragmatic solution). 
Finally, some results on toy models are presented. 

Some technical issues concerning binomial and Dirichlet distribution, 
including the use of the latter as prior conjugate of the former, are reminded 
in Appendix A. A second appendix is dedicated to the handling of the zeros 
occurring in the evaluation of the smearing matrix (a cause that produces 
no event in some effect bins, as it usually happens, depending also on the 
statistics of the Monte Carlo simulation). As it happens with intermediate 
smoothing (or other kinds of regularization) , this is a suggestion of how the 
problem can be approached, whose solution is delegated to the user, who is 
supposed to know the physics case under investigation. 

2 Bayesian inference and Bayesian unfolding: 
from first principles to real life 

The so called Bayesian inference is a way to learn about physical quantities 
and their relationships (all things we cannot directly see) from experimental 
data (what we can actually observe with our senses, usually mediated by 
more or less sophisticated detectors) using probability theory. This game is 
conceptually rather simple, proviso we accept that the intuitive meaning of 
probability - a scale to rank our beliefs that several events might happen, 
or that several hypotheses might be true - is suitable in scientific reasoning 
(see Refs. [B] and [7] and references therein for an introduction). 

The starting point of a Bayesian inference is to build up a model for the 
deterministic ("B follows from A") and the probabilistic ("B might follow 
from A" ) connections that relate the several entities that enter the problem. 
This model has the interesting graphical representation of a network, usually 
called Bayesian network or belief network (to get an idea of their meaning 
and their application Google these keywords, or browse the Wikipedia). For 
example, Fig.[TJ taken from Ref. [I], shows a Bayesian network to model two 
physical quantities, \x x and /j, y , connected to each other with a 'law', whose 
parameters are denoted by 6. In the very elementary case depicted in the 
left hand network of Fig. [I] the solution to the problem can be rather simple, 
under some assumptions that, fortunately, hold very often in routine cases. 
But, in general, the solution is not that simple. Nevertheless, as stressed 
in Ref. [I], after one has built the graphical model, one is often more than 
half the way to the solution, thanks to the great progresses recently made 



3 




[for each i 




b) 



Figure 1: Bayesian network describing a fit model {2|. x % an d Hi are the 
experimental observations, related to \i Xi and fj, yi by experimental errors. 
Instead, a deterministic 'law' connects the 'true' values \i yi to fi Xi via the 
model parameters 9 (solid arrows stand for probabilistic links, dashed for 
deterministic). Network a) describes a simple model with errors only on the 
ordinate. Network b ) takes into account errors on both axes, extra variability 
of the data points around the believed physical law and systematic effects. 



in Bayesian network computing. 

The essence of the Bayesian unfolding of Ref. [2], that is the starting 
point also of this new version, is to make the problem discrete and to treat 
the 'cause' bins as independent degrees of freedom, i.e. without constraints 
among each other H For this reason the algorithm can virtually handle any 
kind of 'smearing' and it is easily extensible, at least in principle and with 
the only limit due to computer power, to multidimensional problems. In 
fact, the core of the algorithm only knows about 'cause-cells' and 'effect- 
cells' but it does not know the location of the cells in the configuration 

3 An important drawback of this feature will be discussed in section [4] 
4 'Bin' and 'cell' are used here as synonyms, although 'cell' refers to a region on the 
configuration space and 'bin' to histograms representing them. 
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Figure 2: Probabilistic links from causes to effects. The node indicated by 
T ('trash') stand for the inefficiency bin and corresponds to E nE+ \ 



space of the problem. For the same reason, the treatment of background, 
and even of several independent sources of background, can be easily em- 
bodied in the algorithm by just adding extra cause-cells, one cell per source 
of background. As a by-product, the algorithm also provides the number 
of events to be attributed to each source of background. (It is worth re- 
membering that background might have an interesting physical meaning, 
and thus the estimation of the level 'noise' might provides indeed a physics 
measurement, as in the analysis of Ref. [8].) 

Given the discretization of the problem, the Bayesian network relating 
causes and effects is that shown in Fig. [21 where we use the same notation of 
Ref. [2], with the addition of the effect bin T ('trash'), equivalent to E nE+ i, 
to describe inefficiency (the reason to introduce this extra bin will become 
clear later). 

Rephrasing the problem in probabilistic terms, the purpose of the un- 
folding is to find the 'true' number of events in each cause bin [#(Cj) in 
Fig. El indicated by x(C{) in the text], given the observed spectrum and 
assuming some knowledge about the smearing. 

Since the links cause— >-effects have a probabilistic nature, it follows that 
also the links effect— ?>causes will be probabilistic, and therefore it will be 
uncertain the number of events to be attributed to the cause-cells. We can 
only attempt to rank in probability all possible spectra that might have 
caused the observed one. In other words, the realistic goal of our analysis is 
not to determine the the true spectrum, but rather to assess 



P(x c \x E , A, /) 
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Figure 3: From the 'true' distribution (numbers of events in the cause-bins) to the 
observed distribution (numbers of events in the effect-bins). The number of events 
) is indicated with x() in the text. 



where H 



xq = {x{C\), x(C*2), . . . , x(C nc )} is the number of events in each bin 
of the true distribution, i.e. a true spectrum^ 

xe = {x(Ei), x(E2), . . . , x(E nE )} is the observed spectrum. 

A stands for the smearing matrix, whose elements Xji (see remarks in 
footnote 4 concerning notation) are defined in probabilistic terms as 



Ajj = P(Ej | Cj, I) . 



5 In Ref. [2] xc, xe and A were respectively indicated by n c , n E and S. As in Fig. [3] 
and in Ref. [2], the symbols i and j are used to index causes and effects, respectively, no 
matter if this choice leads to the unusual convention of indexing the A rows by j and the 
columns by i, as in Eq. ([2]). Note that, when we refers to MC simulations to infer the 
smearing matrix, xe, we also need include the trash bin, as it will be reminded at the 
proper place. Later on [see Eq. |(9|] it will be convenient to name Xi the columns of the 
matrix A. In summary 



A 



/ PiEkia, i) 

P{E 2 \d, I) 



P{Ei\C 2 , I) 
P{E 2 \C 2 , I) 



P(£i|C nc ,J) \ 
P(E 2 \C nc , I) 



\P(E nE+1 \C u I) P(E nE+1 \C 2 , I) 
(Ai, A2, . . . , A„ c ) . 



P(E nE+1 \C nc , I)) 



For the use of the indefinite article in conjunction with 'true values', see the ISO 
"Guide to the expression of uncertainty in measurement" ^[9], according to which a true 
value is "a value compatible with the definition of a given particular quantity. " 
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The knowledge of A comes usually from MC simulation and it is there- 
fore affected by an uncertainty, described, in general terms, by a pdf 
/(A /). 

• / stands for the state of information under which the analysis is per- 
formed (this underlying condition is often implicit in the pdf's). 

Once we have stated clearly and in probabilistic terms our question ( "what 
is P(xc \xe, I) ?"), probability theory provides us the answer^ at least in 
principle. In fact, 

7 Instead, the idea of inverting the smearing matrix (assuming it square and not sin- 
gular) is wrong in principle (i.e. besides the ascertainment that such a method yields 
unacceptable results). In fact, unfolding is a probabilistic problem and not a determin- 
istic linear one (or, equivalently, a geometric problem of rotating vectors). Therefore it 
needs to be solved by probabilistic tools and not by linear algebra methods ('rotations'). 
It is easy to show that 'rotation' works only when we have an 'infinite' number of events, 
such that stochastic effects are negligible and the observations coincides with the expec- 
tations. In fact, each product Xjix(d) is nothing but the expected number of events in 
the effect-cell Ej due to cause d alone: 

E M^)L(cJ = P(Ej\a, i) x (Ci) = \ji x(d) 

Summing up the contributions from all cause-cells, we get the expected value in the effect- 
cell Ej 

E\x(Ej)\ x J = Y^XaxQd), 

i 

that can be written in matrix form as 

fj, E = E[x E ] = Ax c . 
Then, if A is square and not singular, we get 

xc = A~ X /i B . 

But this might be, at most, the solution of a text book exercise in mathematical probability 
theory, and does not help to solve real problems. This is the reason why, besides the fact 
that the matrix inversion gives notoriously bad results, the very idea of inverting the 
smearing matrix is logically flawed: we can certainly apply A -1 to a vector of numbers 
already known to be sums of expected values of binomials, but we cannot apply it to a 
vector of numbers that might be (we are not even sure of this, because some counts could 
be due to background we do not take into account!) sums of binomial random variables. 
If we do it, there is no guarantee that A~ 1 xe yields a vector of 'valid numbers' of the 
n-parameters of binomials (the question that they might have the meaning of a physical 
spectrum for the problem under study is secondary at this level) and, in fact, even negative 
numbers can be obtained! It follows that unfolding methods which use the matrix inversion 
as starting point and try to cure its bad features with some kitchen are not appealing from 
a theoretical point view, although they might even provide acceptable results because of 
'mysterious' reasons I do not want to enter into (cooks might be extremely clever!). 
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1. Bayes' theorem allows to calculate P{xq \xe, A, /), given the obser- 
vation xe and the smearing matrix A, as 



P(x c \x E , A, I) 



P(x E \x c , A, I).P[x c \I) 



(3) 



Ex c P(xe\xc, A, I)-P{x c \I) 



(the formula will be explained in a while). 

2. We can take into account of the uncertainty about A using another 
theorem of probability theory, namely 



Let us now go through the several ingredients needed to get P(xc \xe,I) 
and try to understand were the problems arise. 

• First of all, it is easy to realize that the denominator of Eq. ([3]) is just 
a normalization factor, and then we can rewrite the Bayes' formula in 
a way that highlights the main ingredients: 



where P(xe \ xc, A, /) is the so called likelihood and P(xc \I) the 
prior. The left hand side of the Bayes' formula takes the name of 
posterior. As we see, in probabilistic inference the likelihoods have the 
role of updating probabilities. 

• Although the presence of priors in the formula might cause anxiety in 
those who approach this kind of reasoning for the first time, it is a mat- 
ter of fact that: 1) priors are logically necessary to get P(xq \ xe, A, I) 
starting from the likelihood, i.e. to perform the so-called 'probability 
inversion'; 2) they allow to plug into the model all relevant prior in- 
formation, that might come from previous data or from theoretical 
prejudices; 3) priors are often so vague (or there are so many data - 
that is the same for the relative balance of prior and likelihood in the 
Bayes' formula) that they influence negligibly the posterior, and the 
inference is often dominated by the likelihood. 

• Let us assume this last case of prior vagueness applies. This is equiv- 
alent to have 





(4) 



P(xc | xe, A, I) oc P(x E | x c , A, I) ■ P(x c | J) 



(5) 



P(xc 1 1) = constant 



(6) 



S 



and the inference is then performed according to the rule 



P(x c | x E , A, I) (x P{x E | x c , A, I) . (7) 

It is then not a surprise that the most probable spectrum xq is the 
one which maximizes the likelihood, if we have no other good reason 
to believe that this is not the case. (This is the meaning of the expres- 
sion "recovering maximum likelihood estimators" from the Bayesian 
approach.) 

• If we were able to write down a closed expression for the likelihood, 
or at least to provide a simple algorithmic expression of it, we could 
somehow scan all possible spectra xc, find the most probable one 
(or an 'average' spectrum that summarizes in some sense the variety 
of true spectra compatible with the data) and assess somehow the 
uncertainty about the result. But, unfortunately, this is not the case 
with P{xe I xc, A, I) of our interest. Let us see why. Given a certain 
number of events in a cause-bin x(Ci), the number of events in the 
effect-bins, included the 'trash' one, is described by a multinomial 
distribution (see Appendix A.l): 

F(x £ |,(C,),A,/) = -^—Ylxf'K (8) 

It follows that P(xe \ A, I) is the sum of independent multinomial 
distributions, for which, unfortunately, a closed formula does not ex- 
istH This is the real serious technical problem that prevents a straight 
application of the Bayes' formula. 

• The elements of the smearing matrix A are obtained by MC: we 
generate a large number of events in each cell C{ and count 'where 
they end' after a realistic simulation. Intuitively, we expect Xji « 
x(Ej) MC /x(Ci) MC , but we also know that this is just an estimate, 
with some uncertainty. Fortunately, in this case we can make direct 

8 It is well know that the sum of two Poisson variables is still a Poissonian. Similarly, 
the sum of two binomial variables having the same parameter 'p' is still a binomial. (This 
'reproductive property' applies to few other distributions). But the sum of two binomial 
variables with different p does not have a closed expression (this can be understood either 
intuitively, just thinking to the meaning of a binomial, or, more formally, analyzing the 
product of the characteristic functions of each binomial). The same happens with a 
multinomial, that is just an extension of the binomial to more than two possible outcomes. 
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use of the Bayes' formula applied to MC events, if we model the prior 
using a convenient pdf. In fact, if we indicate by Aj the i-th column 
of A (see footnote 4), i.e. 

Aj = A2,i, • • • , X nE + i : i} , (9) 

we have 

f[\\x^ c ,x(C i ) MC l I\ oc P[xz C I x(d) MC , \i, I] ■ f(\i 1 I) . 

(10) 

Since P[xg \ x(Ci) MC , Aj, I] is a multinomial, choosing a Dirichlet 
prior (an extension of the Beta distribution to many dimensions - see 
Appendix A. 2), we get a Dirichlet posterior (Appendix A. 3), i.e. 

Aj ~ Dir (otposterion ) i (11) 

with 

^posteriori ^-priori ^E \x(d) ' (1^) 

(Hereafter the symbols '~' stands for 'follows the probability distribu- 
tion'.) 

• Finally there is the integral ([3]), which is in principle not a problem 
(for example, thanks to modern technologies, it can easily performed 
by MC methods). 

In conclusion, the main serious problem is P(xq \ xe, A, /). Thus being 
unable to tackle the problem from the main door, we need some 'tricks' to 
circumvent the obstacle, but still under the guidance of probability theory. 

3 Practical algorithm to perform an independent- 
bin Bayesian unfolding 

The basic trick to avoid the mentioned difficulty is to apply Bayes' theo- 
rem to causes and effects, instead than to true and observed spectrum. In 
practice, instead of using of Eq. ([3]) we start from 

p(c\f n p(^-iQ,/)-mm , n * 
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x(E 1 ) 




x(E 2 ) 







x(E n , 



Figure 4: Sharing counts observed in effect-cells among cause-cells according 
to 9 lJ =P(C i \E J , I). 



or 



'Kl 



Xji ■ P(Cj 1 1) 



(14) 



having defined 6ij = P{C% \Ej, I) in analogy to Xji = P(Ej \ Ci, I) . 

At this point a very important remark is in order. The prior in Eqs. (|13p - 
(|14p has a different meaning from that of Eq. ([3]). In Eq. P(xc\I) 
assigns different probabilities to all possible spectra. Instead, in Eqs. (fT3|) - 
(I14p P(Ci | /) stands for a single spectrum (more precisely, all spectra that 
differ from each other just by normalization). This can be better understood 
analyzing the meaning of 'uniform' (or 'flat') referred to P(xc \ I) and to 
P{Ci\I). 

• P(xc 1 1) = constant means all spectra are equally likely. 

• P(C\ | /) = constant means we consider the cause bins equally likely, 
i.e. the prior assess an initial belief in flat spectra. 

In other words, while a flat P(xc \ I) means indifference about all possible 
spectra in order to 'let the data speak by themselves', a flat P(Ci \I) is a 
strong assumption that usually does not correspond to our priors concerning 
the physics case. This implies that we have to tune somehow the algorithm 
in order to take into account of this gross approximation. We will come back 
to this issue in Sec. HI 

At this point, having evaluated P(Ci \Ej, I), we can use it to share 
the counts observed in each effect-bin among all cause-bins (see Fig. [5]). 
The estimate of the true spectrum is obtained repeating this sharing for all 
observed bins and taking into account inefficiency. An uncertainty on the 
unfolded spectrum is also evaluated. Let us see how all this was done in the 
old algorithm and how it has been improved. 
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3.1 Old algorithm (Ref. [2]) 

In the old algorithm the share of observed events among the causes was done 
only considering expectations and applying an inefficiency correction, going 
through the following steps: 

• number of counts in Cj due to the observation in E~: 

<Ci)\ <Ej) « P(C i \E j ,I).x(E j ) = e ij -x(E j ); (15) 

• number of counts in d due to all observations: 

n E n E 

*(Pi)\ XB « Y j P{C i \E j ,I)-x{E j )=Y J G i yx{E j ) ] (16) 

3=1 3=1 



number of counts in C{ that also takes into account efficiency (e«): 

tie 

(, * < 



x(Ci) » - x(C i )\ Xa = -Y i P(C i \E ji r)-x(E j ) (17) 

4 3=1 

7 ^e ir x[E-) (18) 



where 



7=1 



n B n E 



a = J2P(E j \C i ,I) = J2 X 3i = 1 - X n E +l,i- (19) 
3=1 3=1 

• Finally, we have to remember that the several A« were estimated by 
MC simulation as 

Xji « x{Ej) MC /x(Ci) MC , (20) 
from which and 6*jj are calculated according to Eqs. (|19j) and (HH). 

Essentially, x{Ci) of Eq. (|17j) was considered an 'estimator', whose 'error' was 
calculated by standard 'error propagation' (this was a pragmatic compromise 
between 'standard methods' I was accustomed at that time and the Bayesian 
approach I was in the process of learning - anyhow, not bad to begin, and 
not worst than other methods.) 

At this point, there is still open the issue of the inappropriate priors, that 
we have encountered at the beginning of this section. Since this question 
remains in the improved algorithm, it will be discussed later in Sec. HI 
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3.2 Improvements 



As it is easy to understand from the previous description, weak points of 
the old algorithm are the treatment of small numbers and the calculation of 
uncertainty by 'error propagation formulas'. Let us see how we can improve 
them. 

• Instead of just 'estimate' Xij as x(Ej) MC /x{Ci) MC , we can model their 
pdf by a Dirichlet (see Appendix A): 

Xi ~ Dir[a prior + x^ c \ x{Ci)MC ], (21) 

where 

— OLprioy is the initial set of Dirichlet parameters - typically unitary 
for uniform priors (see Appendix B for details); 

— x e IC '\x(c-) mc s * an ds for the numbers of MC counts generated in 
Cj and which end in each of all possible tie + 1 'effects' (i.e. we 
also have to consider the inefficiency bin - see Fig. [2] - in order 
to satisfy the condition ^ • Aj = 1 of the Dirichlet variables) . 

• The uncertainty about A is taken into account by sampling its values, 
that is equivalent to perform the integral (jH)o Therefore, for each 
sampling, 

— the smearing matrix elements A™ are extracted according to a 
Dirichlet distribution; 

— the efficiencies = Y?j=i ^ji are calculated; 

— the inverse probabilities 9ij are obtained from the Bayes formula. 

• The sharing of the number of counts x(Ej) observed in an effect-cell 
among all cause-bins is performed using a multinomial distribution 
(see Appendix A.l): 

x c\ x{Ej ) ~ Mult[x(^), d ] , (22) 
with Oj = {6i j, 2 j, ... , nc j, }. 

9 Note that if we are in doubt about several smearing matrices, because e.g. they are 
obtained from different parameters of the simulation, the ('systematic') effect of this uncer- 
tainty can be taken into account sampling from the different A's, with weights depending 
on our confidence on each of them. 
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• The contributions of all observed effect-bins are summed up, i.e. 



X g\x E = Yj X o\x{E 3 ) ■ ( 23 ) 



"3 

3=1 

Inefficiency are taken into account as in the old algorithm: 

xc = (24) 

where the ratio in the r.h.s is done component by component. 

However, we need to remember that the observed number of events in each 
effect bin, x(Ej), comes from a Poisson distribution with unknown parame- 
ter jXj, whose inference can be conveniently performed starting from a con- 
jugate gamma distribution (see e.g. Ref. [6]). It follows that 

fj,j ~ Gamma[cj + x(Ej), rj + 1] , (25) 

where c,- and rj are the initial parameters of the gamma. [The case of a flat 
prior corresponds to cj = 1 and rj — > 0, i.e. an exponential with vanishing 
rate.] 

Therefore, what should be shared among the cause-bins is not x(Ej) 
but /ij, extracted in each sampling according to Eq. ([25]) . However, fj,j is 
a continuous variable and therefore cannot be used as 'trials parameter' of 
a multinomial distribution. Nevertheless, fractional events can indeed be 
shared making use of a little trick: 

• fij is extracted according to Eq. (j25|) and it is rounded to the closest 
positive integer, that we indicate here by rrij: rrij will be the number 
of trials used in the multinomial random generator. 

• xc\ m . is extracted and then rescaled by the factor Hj/mj, i.e. 

x c \ m . ~ Mult (my, Oj) , (26) 
*cL. = — *c\ m . ■ (27) 

This means that Eq. (|23|) has to be replaced by 

TIE 



E x ^.; (28) 
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then the inefficiency corrections are applied as usual. 

Each sampling provides a spectrum x®, where t is the 'time' index of 
the sampling. After N samplings we can calculate an average spectrum, 
variance and covariance for each cause-bin, as well as covariances among 
bins, and any other statistical summary; or we can inspect graphically each 
x{d). 

This algorithm has been implemented in the R language [I] and it is 
freely available on the web [5]. The R language has been chosen because 
it is a powerful scripting language^ open source, multi-platform, oriented 
towards statistics applications, well maintained and with tons of contributed 
extension packages. In practice a kind of programming lingua franca. 

In order to test the algorithm, also a simple event generator has been 
included, that reproduces the toy models of Ref. [2]. Some results will be 
presented in Sec. [5] 

4 Iterations and smoothing 

Let us now go back to the issue of the priors, that we have left open from the 
beginning of the previous section. The crucial thing to understand is that, 
as we stated above, instead of using a prior fiat over the possible spectra, we 
are using a particular, flat spectrum as prior. Therefore, the posterior [i.e. 
the ensemble of x^) obtained by sampling] is affected by this quite strong 
assumption, that seldom holds in real cases. 

Obviously, the first idea a practical physicist has, in order to fix the 
problem, is to do some fine tuning. In fact, simulations on toy models show 
that the unfolded distribution reproduces rather well the true one, even with 
a flat spectrum as prior. However, it still 'remembers the flat prior'. This 
effect can be cured iterating the procedure, using the posterior as prior in a 

10 I take the opportunity to make a point about the teaching of scripting languages, 
and in particular of R, in the physics courses. Surely physics students might need to 
learn C (and perhaps CH — h) at some point, but it is a matter of fact that writing some 
C code to solve little/medium problems that occur everyday, and that might also need 
some graphics, is a pain, not only for students. The consequence is that students usually 
do not write little programs in C to solve the problems they meet in the general physics 
or laboratory courses, continuing to use, instead, spread-sheets, learned in high school 
(and forget real programming until they need it for their theses, if they ever will need 
it). In my opinion, teaching a scripting language like R, perhaps in parallel to C, would 
be a good solution, (but if I really had to chose, I would opt for R, as the language to 
begin). (Personally, since I have discovered scripting languages, I use C only for though 
professional tasks. Indeed, I have even almost abandoned Mathematica, unless I really 
need symbolic calculation.) 
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subsequent unfolding. Empirically we learn that, in 'normal' cases, just two 
or three steps are sufficient to recover quite accurately the true spectrum. 

As it has been discussed in length in Ref. [2], we cannot repeat the 
iterations for a very long time, otherwise there will be a kind of positive 
feedback, driven by fluctuations, that makes the asymptotically unfolded 
spectrum 'crazy' (it simply means that it is far away from all reasonable 
expectations). Again, as we understand it, it is a question of judging the 
outcome by our physics priors0 that, although rather vague, tell us that, 
given the kind of measurement, a spectrum with wild oscillations is far from 
our beliefs. This kind of problem happens also with other kind of unfolding 
algorithms and it is usually cured by regularization. 

Regularization is based on the subjective scientific prejudice that 'wild' 
distributions are not physical (and, as the reader knows or might imagine, 
this very reasonable subjective ingredient is unavoidably also embedded in 
all non-Bayesian methods). In practice, regularization can be implemented 
constraining adjacent bins not to be 'too discontinuous', for example limiting 
the absolute value of the derivatives calculated numerically on the unfolded 
spectrum. Or one might fit the spectrum with some orthonormal functions, 
but dumping 'high frequencies'. And so on. 

I do not have a strong opinion on the matter. My recommendation is 
that one should know what one is doing. My preferred regularization method 
for one-dimensional problems consists in smoothing the posterior before in- 
jecting it as prior of the next iteration (but not after the last step!). This 
smoothing can be performed by a low order polynomial fit, as it is imple- 
mented in the demo scripts accompanying the program. I find this technique 
quite simple, well consistent with the spirit of this unfolding method and 
with the reasons to go through the iterations. Moreover, it is easy to under- 
stand that the procedure unfolding-smoothing-unfolding converges rapidly. 

In conclusion, although the idea of iteration does not seem very Bayesian 
(we cannot squeeze the same data twice!), it is in fact just an adaptive way 
to recover what could have been obtained by a more reasonable uniform 
prior over the possible spectra. (Stated with other words, it would be non- 
Bayesian to stick to the first iteration, obtained from a prior that rarely 
corresponds to physics spectra!) For this reason the idea of iteration has 
been kept, although some effort has been done to get rid of it. In fact, in 
the new code the user can, instead of providing a (typically flat) spectrum 
as prior, set up a function [priorf () ] that returns a normalized spectrum 

11 Good physicists do have priors and always use them! (Only the perfect idiot has no 
priors.) 
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chosen at random according to a probability distribution describing prior 
knowledge. In this way, in each sampling also the prior is sampled. However, 
I anticipate that implementing the function priorf () might not be that 
trivial, and iterative procedure still remains a pragmatic compromise to 
slalom among the difficulties. 

5 Results on toy models 

The program has been checked with the same toy smearing matrices used in 
Ref. [2], reproduced here in Tab. [TJ The results of the simulations are given 
in Fig. \E\ The generated distributions are shown in black. The 'measured' 
distributions (in red) look completely different from the 'true', due to the 
very severe smearing matrices used. The figures also show the results after 
the first iteration (light blue) and the intermediate ones (yellow). Note that, 
although in all simulations twenty iterations have been performed, only a 
few intermediate iterations are visible, because the others overlap with the 
final step. This allows you to get a feeling about the speed of convergence 
of the algorithm, depending on the difficulty of the problem - let us remind 
that, thanks to the intermediate smoothing the algorithm does converge. 
Other independent simulations are reported in Figs. [6] and 

6 Conclusions 

The evaluation of uncertainties of the Bayesian unfolding of Ref. [2] has 
been improved taking the probability density functions of the quantities of 
interest. Simplifications are achieved using prior conjugates and performing 
the relevant integrations, needed to propagate uncertainties, by sampling. 

The paper also discusses the issue of the iterations and the role of the 
intermediate smoothing to reach fast convergence. It is important to note 
that, contrary to other algorithms, the intermediate smoothing acts as a 
regularization on the priors and not on the unfolded spectrum. For this 
reason physical peaks should still appear in the unfolded spectrum. 

The R code of the algorithm is available, together with a little simula- 
tion script to run it on toy models. This script also implements of simple 
intermediate prior regularization, but the reader should have clear in mind 
that this task is left to the user, who is supposed to understand well his/her 
physics case. 
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Table 1: Smearing matrices P{Ej \ Ci) of the toy models 
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Figure 5: Result of unfolding applied to four toy models (data.func param- 
eter from 1 to 4 in the demo program ) and two different smearing matrices 
(left and right figures correspond, respectively, to 'Smearing 1 ' and 'Smear- 
ing 2' of Tab. njj. 
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Figure 7: Same as Fig. Independent complete simulation. 
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Appendix A — Multinomial and Dirichlet distribu- 
tions 



A.l Multinomial distribution 

The Multinomial distribution is the extension of the Binomial from 2 to /c 
possible outcomes. If we assign the probability pi to the i-th outcome and 
think of n trials, then we are interested in the joint probability to observe 
xi times the outcome 1, x 2 times the outcome 2, and so on. From basic 
rules of probability theory and some combinatorics we get 

ft \ 

/[x|Mult(n,p)] = — p-p jpTpT-'-P?, (29) 

X\\ X 2 \ ■ ■ ■ Xk ] - 

with 

x = {xt,x 2 , x k } [xi = 0, 1, . . . ,n; ^ Xj = n } 

i 

P = {Pl,P2, ■ ■ ■ , Pk} [0<p<l; ^2 Pi = l}. 

i 

It is easy to see that the binomial distribution is recovered for k = 2. Note 
that, given the constraint Yli x i = n ? the multinomial has to be seen as a dis- 
tribution for k — 1 variables xi,x 2 , . . . , x^—i, as it is clear from the binomial 
case. Expected values, variances, covariances and correlation coefficients are 
given by 

E(Xi) = n Pi (30) 
VarpQ) = n Pi (l- Pi ) (31) 
CoviX^Xj) = -np iPj [i^j] (32) 

f*Xi,Xi) = 1 n ~ P l Pj n =r Mi]- (33) 
y/Pi (1-Pi)pj (1 - Pi) 

Each marginal, i.e. f[xi | Mult(n, p)], is a binomial with parameters n and 
Pi- 

/[xi|Mult(n jP )] = - p?(l- Pi ) n - x '. (34) 

Xi ■ \Tl Xi ) 

For k = 2 the two variables X\ and X 2 are 100% anti-correlated, as it is 
quite obvious (the knowledge of X\ determines X 2 , and vice versa). In fact, 
since p 2 = 1 — p\ , we get from Eq. ([33]) 

p(X 1 ,X 2 \k = 2) = ~ Pl(1 ~ Pl) _ = -l. (35) 

y/pi (I- Pi) (1 -PI)PI 
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A. 2 Dirichlet distribution 



In this case we start from the formal definition. Given k continuous variables 
whose value can range between zero and one and that sum up to unity, the 
Dirichlet distribution is defined by the following joint pdf: 



/[jc|Dir(a)] 
with 

x 



r(ai + a 2 + • • • + Q fc ) ai _! a2 _i 



X 



afc-1 



T{a 1 )T{a 2 )---Y{a k ) JJl ^ 
{xi,x 2 , ■ ■ ■ , x k } [0<Xi<l; y^Xj = l 



(36) 



ol = {ai,a 2 , . . . , a k } 



[cti > 0], 



where T( ) is the usual gamma function. If k = 2, then the Beta is recov- 
ered. Also in this case there is a constraint on the variables, Yli=i x i = 1; 
that makes the distribution in fact (A; — l)-dimensional. Expected values, 
variances and covariances are 



a, 



a 



Var(^) 
Cov(X t ,X j ) 



ai (a - Oj) 
a 2 (a + 1) 



(37) 
(38) 
(39) 



a 2 (a + 1) 

where a = Yli a %- Each marginal is a Beta with parameters r = ai and 



a — a,-: 



f[xi\Beta(r = ai,s = a - ai)} = 



x 



a.i—1 



{l-Xi) 



\Q — Oj— 1 



(40) 



/?(«i, a - aj) 

where /?() is the beta function (hence the name to the distribution), defined 



as 



f}{r, 8 )= f z r ~\l 
Jo 



\s-l 



dz 



r(r)r( s ) 

T(r + s) 



(41) 



Note that the Beta pdf vanishes for X{ = if r > 1, and for x« = 1 if s > 1, 
respectively. This observation will be important for some considerations 
concerning the treatment of bins with zero counts (see Appendix B.l). 
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A. 3 Dirichlet as prior conjugate of the multinomial 

If the data sample x, modelled by a multinomial distribution, has been 
observed in n trials and we are interested in inferring the multinomial pa- 
rameters p, applying Bayes' theorem we get 

/[p|Mult(n), x] oc f[x\Mult(n,p)} ■ f(p) (42) 
oc p^ P f--- P x k k ■ f(p), (43) 

having absorbed in the normalization all factors not depending on p. If we 
model the prior f(p) with a Dirichlet, i.e. f{p) oc p c f 1 ~ 1 p^ 2-1 ■ ■ ■ Pk k ~ 1 > we 
get 

/[p|Mult(n,*)] (x (pTp?---pT) ■ (PT^PT' 1 ■ ■ ■ pT' 1 ) (44) 
oc p^- 1 pf +x *- 1 . . . p f+^- 1 , (45) 

that is still a Dirichlet. In practice, the observation x updates the Dirichlet 
parameters according to the rule 

^posterior ~ a prior x ■ (46) 

For this reason the Dirichlet is known as the prior conjugate of the multi- 
nomial. In the simplest case of k = 2 we have the Beta prior conjugate of 
the Binomial (see e.g. Ref. [B]). In particular, we can easily see that a flat 
prior is recovered if all a's of the Dirichlet prior are equal to one. 



Appendix B — Handling the zeros 

Bins with null counts might occur either in the Monte Carlo results used to 
infer the smearing matrix, or in the observed spectrum. As it is true that 
when we deal with large numbers we can neglect fluctuations and uncer- 
tainties, while small numbers are somewhat problematic, it is also true that 
handling zeros is particularly challenging and a physicist should mistrust 
abstract mathematical arguments of any kind. In fact, when dealing with 
zeros, subjective prior knowledge becomes crucial, because we need to have 
an idea of 'what a zero might mean': for some bins we could reasonably 
think that, if we would repeat the same experiment or the same simulation, 
the zero might turn into one count, or even in two o three; for other bins, 
instead, we are quite confident that we would need to repeat the experiment 
or the simulation many many times before a zero turns into one count, or 
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perhaps we have good reasons to believe this will never happen. In other 
words, experienced physicists think that 'not all zeros are the same'. 

Take, for example, an histogram in which, after some bins with positive 
(and perhaps quite large) counts, suddenly there is a row of empty bins 
(like the second row of Tab. [5]). Intuitively we do not think all the empty 
bins are equivalent, although for a mathematician their are. But we are 
physicists, and we have formed some opinion about how nature behaves. As 
a consequence, we do not only tend to smooth the positive numbers of the 
histogram, but we also tend to make some extrapolations to the region of 
zero counts, and judge the far zeros to be 'more zero' that the near ones. 
We need then to model somehow our opinion based on past experience. 

In the following subsection we shall see, in a one-dimensional case, how it 
is possible to treat in a consistent way zeros that occur in the MC simulation 
needed to evaluate the smearing matrix, where it might be really relevantly 
Indeed, as discussed in section H] about smoothing, this is a task left to the 
user, which knows the physical meaning of the bins[~^ 



13 



B.l — Zero counts in effect-bins of MC events used to infer 
smearing matrix 

When we make Monte Carlo simulations in order to infer the smearing ma- 
trix we usually observe many zeros, because long range migrations are usu- 
ally rather rare in typical detectorsl^l 



12 Indeed, as it it is common experience, 'row' smearing matrices (i.e. not yet expressed 
in terms of Dirichlet parameters) evaluated from MC simulations have many bins with 
zero counts (in most 'sane' situations most entries are null). Instead, there is not such a 
similar problem in the observed spectrum, where the best pragmatic solution is re-binning, 
in order to avoid zeros and even very small numbers. However, it is possible to imagine 
cases in which one is interested to keep one or a few bins separated from the others, even 
no counts have been observed in it /them. These special bins can be treated ad hoc, in 
a way similar to that discussed in the following subsection, playing on the parameters of 
the relevant gamma prior. 

13 But don't worry: the program gives you the option of treat the zeros as . . . zeros, and 
you might want to skip the following two subsections (that might be a bit 'academic' and 
of little practical relevance in most cases). But you might want to reflect a while that, 
perhaps, considering, democratically, all zeros on equal foot might be against your beliefs, 
while probably you do not want to be incoherent. . . 

14 Nevertheless, when we are interested in reconstructing physics quantities computed 
on an ensemble of tracks and clusters, and the detector has not negligible inefficiencies, 
the migrations can be huge, has it was used to happen in 'two-photon physics' in e + -e~ 
colliders, or in some kinematical regions of deep inelastic scattering, as impressively shown 
in Fig. 6.8 of page 77 of Ref. [10] (the choice of this reference is that similar figures, rather 
popular at the beginning of HERA physics, are usually only available on printed notes 
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As explained in Sec. [2j the columns of the smearing matrix are jointly 
described by vectors of Dirichlet random variables (one vector per cause- 
bin). The parameters of a Dirichlet pdf are updated according to Eq. ([35]). 
that we rewrite here: 

'-"■posterior ~ a prior x > 

where x stands here for x e \ x (c-) t see ^ usual choice of the 

Dirichlet prior to model indifference is to set all a's to 1. The prior is then 
characterized [see Eqs. (j4*0|) . (f37|) and ([55]) ] by the the following marginals, 
expectations and variances (we use the symbol p for the variables, for their 
physical meaning in our contest, and the index j for numbering them, as we 
do in the text for the effect-bins): 

/MBetafl,,-!)] - (47) 
Efe) = - (48) 

V 

Varfo) = — — >1, (49) 

where, hereafter, f is used in place of tie + 1 in order to improve the read- 
ability of the formulas. In the case of v = 2 we recover a flat prior and hence 
an expectation of 1/2. 

If we throw n MC events in order to estimate the several pj, the updated 
a parameters become atj = Xj + 1. For the marginals we have now: 

f\pj\Beta{xj + l,n + u-Xj-l)] = J ) F] -, (50) 



f3(xj + 1, n + v — xj — 1) 



with expected values and variances 



(xj + 1) (n + v 

(n + v) 2 (n + v + 1) 



varfe) = ^. ' . (52) 



In the limit of large numbers (n>v and Xj ^> 1) we get an expected value 
equal to Xj/n (that is the fraction of events in the cell j, in agreement with 
a naive evaluation of the elements of the smearing matrix) , with a variance 
of (xj/n) (1 — Xj/n) j n. 



and conference proceedings, of difficult accessibility). 
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In the case zero counts and large n the pdf is given by 

f(p j \x j = 0) = (n + l)(l-p j ) n , (53) 

having a sharp peak at pj = 0, with sharpness growing with n. Expected 
number and standard deviation are E[pj | Xj] = a\pj \xj] = 1/n [moreover, 
in this case the cumulative has the easy expression F(pj | Xj = 0) = 1 — (1 — 
Pj) n+1 i from which we can calculate a 95% probability upper limit for pj to 
be 1 — 0.05 1// ^ n+1 - ) that is approximately 3/n for large n]. 



Solving a formal paradox with some good sense 

The above conclusions seem reasonable when applied to an individual empty 
bin. But let us see, with the help of Tab. [2j what happens where several 
elements of x are null: one hundred events have been generated in a cause- 
cell and the second row of Tab. [2j whose elements are indicated by x(Ej), 
shows the counts in each effect-bin. The third row shows the prior a and 
the forth one the updated a. We also calculate in the following two rows 
expected values and standard deviation of the pj values. 

All elements of the smearing matrix that connect that cause-cell the 
effect-cells with zero counts have E\pj] = o-\pj] 0.9%. This looks rea- 
sonable if we consider each cell individually. But, observing the values of 
E[pj] and cr\pj] in the fifth and sixth row of Tab. O any physicist would 
be uneasy, for a couple of reasons. First, we are have strong prejudices 
towards regularity of physical laws, including whatever causes the smear- 
ing. If the smearing is peaked around bin E% and drops in both sides, we 
do not expect the smearing probability p\2 = P{E\2 \ Ci , /) to be equal to 
p-j = P(Ej | Ci ,/). Second, if we sum up the probabilities of all bins with 
no counts, i.e. X]j=7E[Pj], the total is not really negligible. The situation 
becomes worse if we add more bins in the right side. 

Another way to understand the problem is starting from the probabil- 
ity density function of pm> the parameter of a binomial distribution that 
describes the occurrence of an event in the set Ad of m selected bins (not 
necessarily adjacent, although in Tab. [2] they are). Just extending a well 
known property of the Dirichlet distribution [see Appendix A. 2, and in par- 
ticular Eq. P01 ]. we get that pm is described by a Beta pdf, with parameter 



a M = ^2kcM a k an d ol — aju, where a = Y^j a f 

f\PM I Bet&(a M , a - a M )\ = M . (54) 

p{aM, a - otM) 
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1/6 




10.2 


20.2 


40.2 


15.2 




10.2 


5.2 


0.2 


0.2 


0.2 


0.2 


0.2 


0.2 


f 


1 


1 


1 


1 




1 


1 


1 


1/2 


1/4 


1/8 


1/16 


1/32 


(rSzr) 
®post 3 * 


10.2 


20.2 


40.2 


15.2 




10.2 


5.2 


0.5 


0.3 


0.13 


0.06 


0.03 


0.015 




10.0 


19.8 


39.4 


14.9 




10.0 


5.1 


0.5 


0.2 


0.1 


0.06 


0.03 


0.02 


v\Pj] 


3.0 


3.9 


4.8 


3.5 




3.0 


2.2 


0.7 


0.5 


0.3 


0.2 


0.2 


0.1 






E 


pi-e] -■ 


= 99.0% 








E[p7-i 2 ] = 1.0% 








a 


Pi-e] ~- 


= 1.0% 












cr[pi-V2 


] = 1.0% 





Table 2: MC events, Dirichlet parameters a and expectations of Dirichlet 
variables pj (value in %) with several treatments of the zeros (see text). 
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If we take A4 to be the set of the m bins with zero counts (m = 6 in the 
case of Tab. [2]) and for each of them we choose the initial value of a to be 
equal 1, we get: 

f\PM I Beta(m, n - m)] = — —, r . (55) 

p{m, n — m) 

Contrary to Eq. (|53p . this pdf vanishes for p_M — > 0, thus stating we are a 
priori sure that pm cannot be zero. This is a bit too much! We understand, 
then, that the required condition in order to have fipM = 0) = is to 
apply the following constraint to the q's: J2kcM ak — ^> ^at Decomes 
c*kcM — l/m in the case we consider all zero count bins on the same foot. 
Since the final a of the bins with many counts is influenced very little by 
the initial a, if this does not exceed 0(1), then we could simply require 
ctj ~ 1/v. But, just to recover the uniform prior when u = 2, which can be 
written as a Dirichlet with a\ = 1 and «2 = 1 (that is a Beta with r = 1 
and s = 1), our starting point will be to set a = 2/v for all bins. Then this 
value will be modified due to the constraint X]fcc.M ak = m l v f° r the bins 
with zero counts. The result of this strategy is reported in Table [2] in the 
rows just below "ay = 2/V'. 



Regularization of the smearing matrix prior 

There is still the problem that we do not believe all zeros are the same. We 
can solve it shaping in some way the a's in a suitable way. Obviously, in or- 
der to do this, we need to know the bin ordering in the parameter space, that 
was not taken into account till now. We show here a simple way of how this 
regularization can be performed in a 1-dimensional spectrum unfolding. We 
set all initial a's to Ijv in all bins that have non-zero counts and in the adja- 
cent ones. Then, as we move far away from the bins with non-zero counts, we 
decrease the value of alpha, for example with some geometric law (i.e. expo- 
nential decrease). In Tab. [2] we have done this exercise taking the geometric 
factor equal 1/2 (see the rows indicated by '/'). That is perhaps a too con- 
servative choice that still takes in too much considerations the tails (indeed, 
in the program the geometric factor is given as an option in the program, 
whose default value is equal to 1/e, i.e. a 'natural' exponential decrease that 
gives roughly one order of magnitude suppression every two bins). The a's 
shaped in this way are finally rescaled with the condition YlkcM ak = m / l/ - 
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To conclude, these are the steps made in the default practical method 
to handle zeros in 1-dimensional unfolding: 



• All bins: 

— set a.j = 2/u + Xj in all bins (a prj in the table). 

• Only bins with zero counts: 

— reshape the a's with an exponentially lawful 

— rescale the a's by the condition YlkcM ak = m / l/ 

["£5/ in the table l- 

The last rows of Tab. [2] show the results obtained applying this rule, which 
seems a reasonable compromise between the several pieces of information 
and beliefs as they have been stated here. Needless to say, you are free 
to adjust the treatments of the zero according to the best knowledge of 
your physical case. The program offers you the possibility to return your 
preferred a's using the function my . alphas () , or just to decide that zeros 
are simply zeros, and basta. 



A warning about the Monte Carlo checks of Bayesian methods 

A last comment about the outcomes of the simulations performed to prove 
the correctness of the above procedure is in order. In fact you might want 
to check the unfolding performing a large number of simulated experiments. 
Most likely you will arrive to the conclusion that the strategy to treat the 
zeros discussed above will provide in average a biased result. This is not 
a surprise, since you are setting some smearing probabilities to positive 
values, whereas they are most likely exactly zero in the simulation. This 
effect is discussed in section 10.6 of Ref. [6] and you should pay attention to 
it and make the simulation in the correct way, before stating that "Bayesian 
methods yield biased estimations" . 



B.2 — Zero counts in real data effect-bins 

Having explained in length a strategy to treat the bins with zero counts in 
the evaluation of the smearing matrix, one could thing of doing something 
similar with the empty bins of the experimental spectrum. However it is 

15 In the case of a sequence of zero counts bins surrounded on both sides by non-zero 
bins, the two exponential starting at each side are summed up, to mimic the fact that 
there might be an incoherent sum of the two tails; 
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clear that in this second case the idea of handling in a sophisticate way the 
zeros might sound purely academic, because an appropriate re-binning is in 
most cases the simplest pragmatic solution. 

Anyway, since a possible strategy follows from what it has been discussed 
in Appendix B.l, let us sketch it here briefly with the benefit of the inventory. 

As we have seen in section 13. 2\ the default initial c and r parameters of 
the gamma pdf [see Eq. (|25|) ] are 1 and 0, respectively, corresponding to an 
improper flat prior (extending to infinite, with infinite expected value). The 
Bayesian updating described by Eq. (I25|) makes the posterior (mathemati- 
cally) nice (i.e. proper) and (physically) meaningful. In particular, empty 
bin j yields 

fl'j ■>■{!■: j ) 0] = e-W, (56) 

having null mode and unity expected value, that is in order. 

The problem occurs when there are several empty bins, because if we 
treat them all in this way, we get that the pdf of the sum of the respec- 
tive jU's (let us call it X) vanishes when its value goes to zero. Again, a 
possible solution is to constrain to zero the mode of S. This can be done 
playing with the c and r parameters of the empty bins, as done, for ex- 
ample, by the custom function (real cases require intelligent inputs by the 
user!) my. gamma. par () of the distribution code [5]. [Note that, contrary to 
my. alphas (), which tries to treat with a certain continuity adjacent zero 
bins, my. gamma. par () is more primitive, because, as said above, it is not 
considered important for practical applications.] 
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